Project Topic: Coronavirus Disease (COVID-19) Data Analysis

Project Description

Overview

Within months, COVID-19 went from an epidemic to a pandemic and has affected people's lives in many ways all over the world. From the first identified case in December 2019, how did the virus spread so fast and widely?
Out of curiosity, I explored the open dataset from the World Health Organization website and tried to applied the analysis skills all together I have learned to this ongoing real-world problem.

Data Availability

The main dataset was acquired from the World Health Organization website, date ranging from January 11, 2020, as to June 20, 2020.

Project Workflow

In the first part, I mainly used Python programming with Numpy and Pandas packages for data wrangling including clean and merge datasets. In terms of exploratory data analysis, several visualization tools used include Matplotlib and seaborn. I also self-learned "plotly express", a high-level interface to create many fancy interactive and informative visualizations.

Secondly, I Integrated Jupyter notebook and MySQL Workbench through API to migrate data from Pandas DataFrame to SQL table

Last but not least, I connected the MySQL server to Tableau. I built a dashboard and deployed it to the public website for offering interactive data visualization.

In [1]:
# Import the relevant libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Suppress all warning
import warnings
warnings.simplefilter(action='ignore')

# Expand the output display to see more columns of DataFrame
pd.set_option('display.max_columns', 50)

# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
In [2]:
# Load the data
df_WHO = pd.read_csv('WHO_COVID19_globaldata.csv')
In [3]:
# Eyeball the data
df_WHO.head()
Out[3]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
0 2020-02-24T00:00:00Z AF Afghanistan EMRO 1 1 0 0
1 2020-02-25T00:00:00Z AF Afghanistan EMRO 0 1 0 0
2 2020-02-26T00:00:00Z AF Afghanistan EMRO 0 1 0 0
3 2020-02-27T00:00:00Z AF Afghanistan EMRO 0 1 0 0
4 2020-02-28T00:00:00Z AF Afghanistan EMRO 0 1 0 0
  • As we can see above, the data is recorded in country-based.

Exploratory data analysis

In [4]:
# There are 23022 observations & 8 columns in this DataFrame 
df_WHO.shape
Out[4]:
(23022, 8)
In [5]:
df_WHO.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23022 entries, 0 to 23021
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Date_reported      23022 non-null  object
 1   Country_code       22923 non-null  object
 2   Country            23022 non-null  object
 3   WHO_region         22885 non-null  object
 4   New_cases          23022 non-null  int64 
 5   Cumulative_cases   23022 non-null  int64 
 6   New_deaths         23022 non-null  int64 
 7   Cumulative_deaths  23022 non-null  int64 
dtypes: int64(4), object(4)
memory usage: 1.4+ MB
In [6]:
# Convert all Date to datetime format 
df_WHO['Date_reported']=pd.to_datetime(df_WHO['Date_reported'])
In [7]:
# Check again column type and we can see Date_reported is recorded under UTC timezone 
df_WHO.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23022 entries, 0 to 23021
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   Date_reported      23022 non-null  datetime64[ns, UTC]
 1   Country_code       22923 non-null  object             
 2   Country            23022 non-null  object             
 3   WHO_region         22885 non-null  object             
 4   New_cases          23022 non-null  int64              
 5   Cumulative_cases   23022 non-null  int64              
 6   New_deaths         23022 non-null  int64              
 7   Cumulative_deaths  23022 non-null  int64              
dtypes: datetime64[ns, UTC](1), int64(4), object(3)
memory usage: 1.4+ MB
In [8]:
# Date ranges from January 11, 2020, as of June 20, 2020.
df_WHO.Date_reported.min(), df_WHO.Date_reported.max()
Out[8]:
(Timestamp('2020-01-11 00:00:00+0000', tz='UTC'),
 Timestamp('2020-06-20 00:00:00+0000', tz='UTC'))
In [9]:
df_WHO.describe(include ='all')
Out[9]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
count 23022 22923 23022 22885 23022.000000 2.302200e+04 23022.000000 23022.000000
unique 162 215 216 6 NaN NaN NaN NaN
top 2020-05-23 00:00:00+00:00 CN China EURO NaN NaN NaN NaN
freq 216 162 162 6895 NaN NaN NaN NaN
first 2020-01-11 00:00:00+00:00 NaN NaN NaN NaN NaN NaN NaN
last 2020-06-20 00:00:00+00:00 NaN NaN NaN NaN NaN NaN NaN
mean NaN NaN NaN NaN 367.695465 1.534403e+04 19.731474 950.461645
std NaN NaN NaN NaN 2010.594357 9.137746e+04 132.386288 5866.605599
min NaN NaN NaN NaN -3525.000000 1.000000e+00 -1918.000000 0.000000
25% NaN NaN NaN NaN 0.000000 2.400000e+01 0.000000 0.000000
50% NaN NaN NaN NaN 5.000000 3.080000e+02 0.000000 6.000000
75% NaN NaN NaN NaN 71.000000 2.693750e+03 2.000000 60.000000
max NaN NaN NaN NaN 38509.000000 2.172212e+06 6409.000000 118205.000000
In [10]:
# Taiwan is excluded from the World Health Assembly
df_WHO[df_WHO['Country'].str.startswith('Taiwan')]
Out[10]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths

Missing data Exploration

In [11]:
df_WHO.isna().any()
Out[11]:
Date_reported        False
Country_code          True
Country              False
WHO_region            True
New_cases            False
Cumulative_cases     False
New_deaths           False
Cumulative_deaths    False
dtype: bool
In [12]:
df_WHO.isna().sum()
Out[12]:
Date_reported          0
Country_code          99
Country                0
WHO_region           137
New_cases              0
Cumulative_cases       0
New_deaths             0
Cumulative_deaths      0
dtype: int64
  • There are two variables "Country_code" & "WHO_region" with missing values.
In [13]:
# Missing Country_code are all from Namibia
df_WHO[df_WHO.Country_code.isna() == True]['Country'].nunique()
df_WHO[df_WHO.Country_code.isna() == True].head(3)
Out[13]:
1
Out[13]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
14449 2020-03-14 00:00:00+00:00 NaN Namibia AFRO 2 2 0 0
14450 2020-03-15 00:00:00+00:00 NaN Namibia AFRO 0 2 0 0
14451 2020-03-16 00:00:00+00:00 NaN Namibia AFRO 0 2 0 0
In [14]:
# For missing values in WHO_region columns are all undefined Country, later will remove them
df_WHO[df_WHO.WHO_region.isna() == True].head(3)
Out[14]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
15865 2020-02-05 00:00:00+00:00 Other* undefined NaN 10 10 0 0
15866 2020-02-06 00:00:00+00:00 Other* undefined NaN 10 20 0 0
15867 2020-02-07 00:00:00+00:00 Other* undefined NaN 41 61 0 0

Six WHO regions

The World Health Organization (WHO) divides the world into six WHO regions, for the purposes of reporting, analysis and administration.

  • African Region (AFRO)
  • Region of the Americas (PAHO)
  • South-East Asia Region (SEARO)
  • European Region (EURO)
  • Eastern Mediterranean Region (EMRO)
  • Western Pacific Region (WPRO)

source: https://en.wikipedia.org/wiki/WHO_regions

Handle Missing Data

In [15]:
df_WHO.Country_code.fillna('NA',inplace=True)
In [16]:
df_WHO.isna().sum()
Out[16]:
Date_reported          0
Country_code           0
Country                0
WHO_region           137
New_cases              0
Cumulative_cases       0
New_deaths             0
Cumulative_deaths      0
dtype: int64
In [17]:
# Drop the rows with WHO_region = NaN
df_WHO.dropna(how='any',subset=['WHO_region'],inplace=True)
In [18]:
# Check: Now I finished the first-step data cleaning process
df_WHO.isna().sum()
Out[18]:
Date_reported        0
Country_code         0
Country              0
WHO_region           0
New_cases            0
Cumulative_cases     0
New_deaths           0
Cumulative_deaths    0
dtype: int64

Manually add Taiwan COVID-19 data

In [19]:
# Load the data (This dataset includes Taiwan)
df = pd.read_csv('covid-data.csv')
In [20]:
df.head()
Out[20]:
iso_code continent location date total_cases new_cases total_deaths new_deaths total_cases_per_million new_cases_per_million total_deaths_per_million new_deaths_per_million total_tests new_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand tests_units stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cvd_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy
0 AFG Asia Afghanistan 2019-12-31 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN 38928341 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
1 AFG Asia Afghanistan 2020-01-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
2 AFG Asia Afghanistan 2020-01-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
3 AFG Asia Afghanistan 2020-01-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
4 AFG Asia Afghanistan 2020-01-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
In [21]:
# Extract Taiwan observations only
df_TW = df[df['location'] == 'Taiwan']
df_TW.head()
Out[21]:
iso_code continent location date total_cases new_cases total_deaths new_deaths total_cases_per_million new_cases_per_million total_deaths_per_million new_deaths_per_million total_tests new_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand tests_units stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cvd_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy
21838 TWN Asia Taiwan 2019-12-31 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN 23816775 NaN 42.2 NaN 8.353 NaN NaN 103.957 NaN NaN NaN NaN NaN 80.46
21839 TWN Asia Taiwan 2020-01-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.00 23816775 NaN 42.2 NaN 8.353 NaN NaN 103.957 NaN NaN NaN NaN NaN 80.46
21840 TWN Asia Taiwan 2020-01-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 5.56 23816775 NaN 42.2 NaN 8.353 NaN NaN 103.957 NaN NaN NaN NaN NaN 80.46
21841 TWN Asia Taiwan 2020-01-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 5.56 23816775 NaN 42.2 NaN 8.353 NaN NaN 103.957 NaN NaN NaN NaN NaN 80.46
21842 TWN Asia Taiwan 2020-01-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 5.56 23816775 NaN 42.2 NaN 8.353 NaN NaN 103.957 NaN NaN NaN NaN NaN 80.46
In [23]:
# There are lots of unnecessary columns and later will remove them 
df_TW.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 21838 to 22008
Data columns (total 34 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   iso_code                         171 non-null    object 
 1   continent                        171 non-null    object 
 2   location                         171 non-null    object 
 3   date                             171 non-null    object 
 4   total_cases                      170 non-null    float64
 5   new_cases                        170 non-null    float64
 6   total_deaths                     170 non-null    float64
 7   new_deaths                       170 non-null    float64
 8   total_cases_per_million          170 non-null    float64
 9   new_cases_per_million            170 non-null    float64
 10  total_deaths_per_million         170 non-null    float64
 11  new_deaths_per_million           170 non-null    float64
 12  total_tests                      94 non-null     float64
 13  new_tests                        90 non-null     float64
 14  total_tests_per_thousand         94 non-null     float64
 15  new_tests_per_thousand           90 non-null     float64
 16  new_tests_smoothed               96 non-null     float64
 17  new_tests_smoothed_per_thousand  96 non-null     float64
 18  tests_units                      103 non-null    object 
 19  stringency_index                 166 non-null    float64
 20  population                       171 non-null    int64  
 21  population_density               0 non-null      float64
 22  median_age                       171 non-null    float64
 23  aged_65_older                    0 non-null      float64
 24  aged_70_older                    171 non-null    float64
 25  gdp_per_capita                   0 non-null      float64
 26  extreme_poverty                  0 non-null      float64
 27  cvd_death_rate                   171 non-null    float64
 28  diabetes_prevalence              0 non-null      float64
 29  female_smokers                   0 non-null      float64
 30  male_smokers                     0 non-null      float64
 31  handwashing_facilities           0 non-null      float64
 32  hospital_beds_per_thousand       0 non-null      float64
 33  life_expectancy                  171 non-null    float64
dtypes: float64(28), int64(1), object(5)
memory usage: 46.8+ KB
In [24]:
# Only fetch the fields common with df_WHO DataFrame

df_TW = df_TW[['date','iso_code','location','new_cases',\
               'total_cases','new_deaths','total_deaths']]
In [25]:
df_TW.head()
Out[25]:
date iso_code location new_cases total_cases new_deaths total_deaths
21838 2019-12-31 TWN Taiwan 0.0 0.0 0.0 0.0
21839 2020-01-01 TWN Taiwan 0.0 0.0 0.0 0.0
21840 2020-01-02 TWN Taiwan 0.0 0.0 0.0 0.0
21841 2020-01-03 TWN Taiwan 0.0 0.0 0.0 0.0
21842 2020-01-04 TWN Taiwan 0.0 0.0 0.0 0.0
In [26]:
# Taiwan belongs to WPRO region, I added WHO_region columns in df_TW DataFrame
df_TW.insert(3,'WHO_region','WPRO')
In [27]:
df_TW.head(5)
Out[27]:
date iso_code location WHO_region new_cases total_cases new_deaths total_deaths
21838 2019-12-31 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21839 2020-01-01 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21840 2020-01-02 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21841 2020-01-03 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21842 2020-01-04 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
In [28]:
# Make the date column of type datetime(UTC timezone)
df_TW['date']=pd.to_datetime(df_TW['date']).dt.tz_localize('UTC')
In [29]:
df_TW.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 21838 to 22008
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype              
---  ------        --------------  -----              
 0   date          171 non-null    datetime64[ns, UTC]
 1   iso_code      171 non-null    object             
 2   location      171 non-null    object             
 3   WHO_region    171 non-null    object             
 4   new_cases     170 non-null    float64            
 5   total_cases   170 non-null    float64            
 6   new_deaths    170 non-null    float64            
 7   total_deaths  170 non-null    float64            
dtypes: datetime64[ns, UTC](1), float64(4), object(3)
memory usage: 12.0+ KB
In [30]:
df_TW['date'].min(), df_TW['date'].max()
Out[30]:
(Timestamp('2019-12-31 00:00:00+0000', tz='UTC'),
 Timestamp('2020-06-19 00:00:00+0000', tz='UTC'))
In [31]:
df_TW.head()
Out[31]:
date iso_code location WHO_region new_cases total_cases new_deaths total_deaths
21838 2019-12-31 00:00:00+00:00 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21839 2020-01-01 00:00:00+00:00 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21840 2020-01-02 00:00:00+00:00 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21841 2020-01-03 00:00:00+00:00 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21842 2020-01-04 00:00:00+00:00 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
In [32]:
df_WHO.columns.tolist()
Out[32]:
['Date_reported',
 'Country_code',
 'Country',
 'WHO_region',
 'New_cases',
 'Cumulative_cases',
 'New_deaths',
 'Cumulative_deaths']
In [33]:
# Re-assign df_TW column name to be consistent with that of df_WHO
df_TW.columns = df_WHO.columns.tolist()
df_TW.head()
Out[33]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
21838 2019-12-31 00:00:00+00:00 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21839 2020-01-01 00:00:00+00:00 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21840 2020-01-02 00:00:00+00:00 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21841 2020-01-03 00:00:00+00:00 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
21842 2020-01-04 00:00:00+00:00 TWN Taiwan WPRO 0.0 0.0 0.0 0.0
In [34]:
# Remove the date with 0 Cumulative_cases
df_TW = df_TW[df_TW['Cumulative_cases'] != 0]
In [35]:
# Reset index starting from 0 
df_TW.index = range(len(df_TW))
In [36]:
df_TW.head()
Out[36]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
0 2020-01-21 00:00:00+00:00 TWN Taiwan WPRO 1.0 1.0 0.0 0.0
1 2020-01-22 00:00:00+00:00 TWN Taiwan WPRO 0.0 1.0 0.0 0.0
2 2020-01-23 00:00:00+00:00 TWN Taiwan WPRO 0.0 1.0 0.0 0.0
3 2020-01-24 00:00:00+00:00 TWN Taiwan WPRO 0.0 1.0 0.0 0.0
4 2020-01-25 00:00:00+00:00 TWN Taiwan WPRO 2.0 3.0 0.0 0.0

Missing Data Exploration

In [37]:
df_TW.isna().sum()
Out[37]:
Date_reported        0
Country_code         0
Country              0
WHO_region           0
New_cases            1
Cumulative_cases     1
New_deaths           1
Cumulative_deaths    1
dtype: int64
In [38]:
df_TW.isna().sum()
Out[38]:
Date_reported        0
Country_code         0
Country              0
WHO_region           0
New_cases            1
Cumulative_cases     1
New_deaths           1
Cumulative_deaths    1
dtype: int64
In [39]:
# Missing values are all from the same record 
df_TW[df_TW['New_cases'].isna()]
Out[39]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
48 2020-03-10 00:00:00+00:00 TWN Taiwan WPRO NaN NaN NaN NaN
In [40]:
# Now I can extract specific records based on the internal index 
df_TW.iloc[45:50,]
Out[40]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
45 2020-03-07 00:00:00+00:00 TWN Taiwan WPRO 1.0 45.0 0.0 1.0
46 2020-03-08 00:00:00+00:00 TWN Taiwan WPRO 0.0 45.0 0.0 1.0
47 2020-03-09 00:00:00+00:00 TWN Taiwan WPRO 0.0 45.0 0.0 1.0
48 2020-03-10 00:00:00+00:00 TWN Taiwan WPRO NaN NaN NaN NaN
49 2020-03-11 00:00:00+00:00 TWN Taiwan WPRO 3.0 48.0 0.0 1.0
In [41]:
# Should fill values same as the previous record to make it complete 
df_TW.fillna(method='ffill',inplace = True)
In [42]:
# Check it again 
df_TW.iloc[45:50,]
Out[42]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
45 2020-03-07 00:00:00+00:00 TWN Taiwan WPRO 1.0 45.0 0.0 1.0
46 2020-03-08 00:00:00+00:00 TWN Taiwan WPRO 0.0 45.0 0.0 1.0
47 2020-03-09 00:00:00+00:00 TWN Taiwan WPRO 0.0 45.0 0.0 1.0
48 2020-03-10 00:00:00+00:00 TWN Taiwan WPRO 0.0 45.0 0.0 1.0
49 2020-03-11 00:00:00+00:00 TWN Taiwan WPRO 3.0 48.0 0.0 1.0
In [43]:
# Now I finished the second step data cleaning procedure
df_TW.isna().sum()
Out[43]:
Date_reported        0
Country_code         0
Country              0
WHO_region           0
New_cases            0
Cumulative_cases     0
New_deaths           0
Cumulative_deaths    0
dtype: int64

Combine df_WHO & df_TW and name it as "df_World"

In [44]:
df_World = pd.concat([df_WHO, df_TW], axis = 0)
In [45]:
df_World.head()
Out[45]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
0 2020-02-24 00:00:00+00:00 AF Afghanistan EMRO 1.0 1.0 0.0 0.0
1 2020-02-25 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
2 2020-02-26 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
3 2020-02-27 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
4 2020-02-28 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
In [46]:
df_World.tail()
Out[46]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
145 2020-06-15 00:00:00+00:00 TWN Taiwan WPRO 0.0 443.0 0.0 7.0
146 2020-06-16 00:00:00+00:00 TWN Taiwan WPRO 2.0 445.0 0.0 7.0
147 2020-06-17 00:00:00+00:00 TWN Taiwan WPRO 0.0 445.0 0.0 7.0
148 2020-06-18 00:00:00+00:00 TWN Taiwan WPRO 0.0 445.0 0.0 7.0
149 2020-06-19 00:00:00+00:00 TWN Taiwan WPRO 1.0 446.0 0.0 7.0
In [47]:
# Reset index starting from 0 
df_World.index = range(len(df_World))
df_World.tail()
Out[47]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
23030 2020-06-15 00:00:00+00:00 TWN Taiwan WPRO 0.0 443.0 0.0 7.0
23031 2020-06-16 00:00:00+00:00 TWN Taiwan WPRO 2.0 445.0 0.0 7.0
23032 2020-06-17 00:00:00+00:00 TWN Taiwan WPRO 0.0 445.0 0.0 7.0
23033 2020-06-18 00:00:00+00:00 TWN Taiwan WPRO 0.0 445.0 0.0 7.0
23034 2020-06-19 00:00:00+00:00 TWN Taiwan WPRO 1.0 446.0 0.0 7.0
In [48]:
df_World.shape
Out[48]:
(23035, 8)
In [49]:
df_World.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23035 entries, 0 to 23034
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   Date_reported      23035 non-null  datetime64[ns, UTC]
 1   Country_code       23035 non-null  object             
 2   Country            23035 non-null  object             
 3   WHO_region         23035 non-null  object             
 4   New_cases          23035 non-null  float64            
 5   Cumulative_cases   23035 non-null  float64            
 6   New_deaths         23035 non-null  float64            
 7   Cumulative_deaths  23035 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(4), object(3)
memory usage: 1.4+ MB
In [50]:
df_World.describe(include ='all')
Out[50]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
count 23035 23035 23035 23035 23035.000000 2.303500e+04 23035.000000 23035.000000
unique 162 216 216 6 NaN NaN NaN NaN
top 2020-05-23 00:00:00+00:00 CN China EURO NaN NaN NaN NaN
freq 216 162 162 6895 NaN NaN NaN NaN
first 2020-01-11 00:00:00+00:00 NaN NaN NaN NaN NaN NaN NaN
last 2020-06-20 00:00:00+00:00 NaN NaN NaN NaN NaN NaN NaN
mean NaN NaN NaN NaN 367.475147 1.533309e+04 19.720078 949.893466
std NaN NaN NaN NaN 2010.047781 9.135277e+04 132.349792 5864.998404
min NaN NaN NaN NaN -3525.000000 1.000000e+00 -1918.000000 0.000000
25% NaN NaN NaN NaN 0.000000 2.400000e+01 0.000000 0.000000
50% NaN NaN NaN NaN 5.000000 3.020000e+02 0.000000 6.000000
75% NaN NaN NaN NaN 71.000000 2.690000e+03 2.000000 60.000000
max NaN NaN NaN NaN 38509.000000 2.172212e+06 6409.000000 118205.000000
  • Now we are ready to do more analysis with this clean data

Data Overview

  • Earliest/first day of the case being recorded by WHO is 2020-01-11
In [51]:
df_World.Date_reported.min()
Out[51]:
Timestamp('2020-01-11 00:00:00+0000', tz='UTC')
  • Earliest case happened in China
In [52]:
df_World[(df_World.Date_reported == df_World.Date_reported.min())]
Out[52]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
4337 2020-01-11 00:00:00+00:00 CN China WPRO 41.0 41.0 1.0 1.0
  • Check how many total days this data covers
In [53]:
df_World.Date_reported.nunique()
Out[53]:
162
  • Check how many different countries had confirmed cases within January
In [54]:
df_World[(df_World.Date_reported < '2020-02-01') & (df_World.Cumulative_cases != 0)]\
['Country'].nunique()
Out[54]:
25
  • Explroe how many different country had confirmed cases?
In [55]:
df_World['Country'].nunique()
Out[55]:
216
  • Explore the earliest date there are a total of 10 countries have deaths reported.
In [56]:
df_death = df_World[df_World.Cumulative_deaths != 0]
df_death.head()
Out[56]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
28 2020-03-23 00:00:00+00:00 AF Afghanistan EMRO 6.0 40.0 1.0 1.0
29 2020-03-24 00:00:00+00:00 AF Afghanistan EMRO 2.0 42.0 0.0 1.0
30 2020-03-25 00:00:00+00:00 AF Afghanistan EMRO 32.0 74.0 0.0 1.0
31 2020-03-26 00:00:00+00:00 AF Afghanistan EMRO 6.0 80.0 1.0 2.0
32 2020-03-27 00:00:00+00:00 AF Afghanistan EMRO 11.0 91.0 0.0 2.0
In [57]:
Confirmed_death = df_death.groupby(by = 'Date_reported')['Country'].nunique()
Confirmed_death
Out[57]:
Date_reported
2020-01-11 00:00:00+00:00      1
2020-01-12 00:00:00+00:00      1
2020-01-13 00:00:00+00:00      1
2020-01-14 00:00:00+00:00      1
2020-01-15 00:00:00+00:00      1
                            ... 
2020-06-16 00:00:00+00:00    187
2020-06-17 00:00:00+00:00    187
2020-06-18 00:00:00+00:00    187
2020-06-19 00:00:00+00:00    187
2020-06-20 00:00:00+00:00    186
Name: Country, Length: 162, dtype: int64
In [58]:
Confirmed_death[Confirmed_death == 10].index[0]
Out[58]:
Timestamp('2020-03-01 00:00:00+0000', tz='UTC')

Regional-based analysis

  • Consider all confirmed cases, explore how many accumulated cases & accumulated deaths in 6 WHO regions separately
In [59]:
df_region = df_World.groupby('WHO_region')[['New_cases','New_deaths']].sum().reset_index().\
rename(columns={'New_cases':'Cumulative_cases','New_deaths':'Cumulative_deaths'})

df_region.head()
Out[59]:
WHO_region Cumulative_cases Cumulative_deaths
0 AFRO 208535.0 4750.0
1 AMRO 4163813.0 215903.0
2 EMRO 856650.0 19041.0
3 EURO 2490815.0 190903.0
4 SEARO 541041.0 16360.0
* Calculate Case Fatality Rate (CFR)

The Case Fatality Rate (CFR) is the ratio between confirmed deaths and confirmed cases.

In [60]:
df_region['CFR']= round((df_region['Cumulative_deaths']/df_region['Cumulative_cases']),3)
df_region.sort_values('CFR', ascending = False)
Out[60]:
WHO_region Cumulative_cases Cumulative_deaths CFR
3 EURO 2490815.0 190903.0 0.077
1 AMRO 4163813.0 215903.0 0.052
5 WPRO 203936.0 7295.0 0.036
4 SEARO 541041.0 16360.0 0.030
0 AFRO 208535.0 4750.0 0.023
2 EMRO 856650.0 19041.0 0.022

Data Visualization - Plot the data

  • Seaborn
In [61]:
# Bar chart
sns.catplot(x='WHO_region', y='Cumulative_cases',data=df_region, kind='bar',aspect=1.5)
sns.catplot(x='WHO_region', y='Cumulative_deaths',data=df_region, kind='bar',aspect=1.5)
Out[61]:
<seaborn.axisgrid.FacetGrid at 0x11a907588>
Out[61]:
<seaborn.axisgrid.FacetGrid at 0x117517240>
  • Matplotlib
In [68]:
# pie chart

fig = plt.figure(2, figsize=(6,6))
labels=['AFRO', 'AMRO', 'EMRO', 'EURO','SEARO','WPRO']
explode=(0.1,0,0.1,0,0.1,0.1)
_ = plt.pie(df_region['Cumulative_cases'],explode=explode,labeldistance=1.35,pctdistance=1.2, \
        labels=labels,autopct='%1.1f%%',
       shadow=True,startangle=-40)
_ = plt.title('Cumulative_cases across 6 WHO Region')
plt.show()


fig = plt.figure(2, figsize=(6,6))
labels=['AFRO', 'AMRO', 'EMRO', 'EURO','SEARO','WPRO']
explode=(0.1,0,0.1,0,0.1,0.1)
_ = plt.pie(df_region['Cumulative_deaths'],explode=explode,labeldistance=1.35,pctdistance=1.2, \
        labels=labels,autopct='%1.1f%%',
       shadow=True,startangle=-40)
_ = plt.title('Cumulative_deaths across 6 WHO Region')
plt.show()
  • Plotly Express: a high-level interface of plotly.py that allows us to create many interactive and informative visualizations.
In [65]:
import plotly.express as px
In [66]:
# bar chart 
fig = px.bar(df_region, x="WHO_region", y="Cumulative_deaths", \
             color="WHO_region")
fig.show()
In [67]:
# pie chart 
fig = px.pie(df_region, values='Cumulative_deaths', names='WHO_region', \
             title='Cumulative_deaths across 6 WHO Region')
fig.show()

Country-based analysis

In [68]:
df_country = df_World.groupby(by = ['WHO_region','Country','Country_code'])\
[['New_cases','New_deaths']].sum()\
.rename(columns={'New_cases':'Cumulative_cases','New_deaths':'Cumulative_deaths'})
In [69]:
df_country.head()
Out[69]:
Cumulative_cases Cumulative_deaths
WHO_region Country Country_code
AFRO Algeria DZ 11504.0 825.0
Angola AO 166.0 8.0
Benin BJ 650.0 11.0
Botswana BW 89.0 1.0
Burkina Faso BF 901.0 53.0
In [70]:
# Alternatively
df_country = df_World.groupby(by = 'Country')[['Country_code','Cumulative_cases',\
                                            'Cumulative_deaths','WHO_region']].max()
In [71]:
df_country.head()
Out[71]:
Country_code Cumulative_cases Cumulative_deaths WHO_region
Country
Afghanistan AF 27878.0 548.0 EMRO
Albania AL 1838.0 42.0 EURO
Algeria DZ 11504.0 825.0 AFRO
Andorra AD 855.0 52.0 EURO
Angola AO 166.0 8.0 AFRO
  • Explore what's the top 5 country has the most confirmed cases as of June 20, 2020
In [72]:
df_country.sort_values(by=['Cumulative_cases'], ascending = False).head()
Out[72]:
Country_code Cumulative_cases Cumulative_deaths WHO_region
Country
United States of America US 2172212.0 118205.0 AMRO
Brazil BR 978142.0 47748.0 AMRO
Russian Federation RU 569063.0 7841.0 EURO
India IN 380532.0 12573.0 SEARO
The United Kingdom GB 300473.0 42288.0 EURO
In [73]:
# Alternatively 
df_country['Cumulative_cases'].nlargest(5)
Out[73]:
Country
United States of America    2172212.0
Brazil                       978142.0
Russian Federation           569063.0
India                        380532.0
The United Kingdom           300473.0
Name: Cumulative_cases, dtype: float64
  • Explore what's the top 5 Country has the fewest confirmed cases as of June 20, 2020
In [74]:
df_country['Cumulative_cases'].nsmallest(5)
Out[74]:
Country
Saint Pierre and Miquelon           1.0
Anguilla                            3.0
Lesotho                             4.0
Saint Barthélemy                    6.0
Bonaire, Sint Eustatius and Saba    7.0
Name: Cumulative_cases, dtype: float64
  • Explore what's the top 5 Country has the highest confirmed deaths as of June 20, 2020
In [75]:
df_country['Cumulative_deaths'].nlargest(5)
Out[75]:
Country
United States of America    118205.0
Brazil                       47748.0
The United Kingdom           42288.0
Italy                        34514.0
France                       29537.0
Name: Cumulative_deaths, dtype: float64
In [76]:
# Calculate Case Fatality Rate (CFR)
df_country['CFR']= round((df_country['Cumulative_deaths']/df_country['Cumulative_cases']),3)
df_country.head()
Out[76]:
Country_code Cumulative_cases Cumulative_deaths WHO_region CFR
Country
Afghanistan AF 27878.0 548.0 EMRO 0.020
Albania AL 1838.0 42.0 EURO 0.023
Algeria DZ 11504.0 825.0 AFRO 0.072
Andorra AD 855.0 52.0 EURO 0.061
Angola AO 166.0 8.0 AFRO 0.048
  • Explore what's the top 5 Country has the highest CFR as of June 20, 2020
In [77]:
df_country.sort_values(by = 'CFR', ascending = False).head(5)
Out[77]:
Country_code Cumulative_cases Cumulative_deaths WHO_region CFR
Country
Yemen YE 913.0 249.0 EMRO 0.273
France FR 153557.0 29537.0 EURO 0.192
Sint Maarten SX 78.0 15.0 AMRO 0.192
Belgium BE 60348.0 9683.0 EURO 0.160
Italy IT 238159.0 34514.0 EURO 0.145
In [78]:
# In order to plot the data, reset the index
df_country = df_country.reset_index() 
df_country.head()
Out[78]:
Country Country_code Cumulative_cases Cumulative_deaths WHO_region CFR
0 Afghanistan AF 27878.0 548.0 EMRO 0.020
1 Albania AL 1838.0 42.0 EURO 0.023
2 Algeria DZ 11504.0 825.0 AFRO 0.072
3 Andorra AD 855.0 52.0 EURO 0.061
4 Angola AO 166.0 8.0 AFRO 0.048

Data Visualization - Plot the data

  • Plotly Express
In [79]:
# scatter plot: 
# It seems that there is a positive relationship between 'Cumulative_deaths' &'Cumulative_cases'

fig = px.scatter(df_country, x="Cumulative_deaths", y='Cumulative_cases',\
                 trendline="ols",hover_name="Country")
fig.show()
In [80]:
# scatter plot: larger circle means higher CFR for that country
# United States of America on the right top is an outlier 
fig = px.scatter(df_country, x="Cumulative_deaths", y='Cumulative_cases',\
                 size="CFR", color="WHO_region", hover_name="Country", size_max=30, \
                 template = 'plotly_dark')
fig.show()
In [81]:
# treemaps:larger size of rectangles means more Cumulative_cases 
# Color the closer to blue means higher CFR(Case Fatality Rate. 
# As we can see France has high CFR. 

fig = px.treemap(df_country, path=['WHO_region', 'Country'], values='Cumulative_cases', \
                  color='CFR',hover_data=['Country'], color_continuous_scale='RdBu')
fig.show()
In [82]:
# maps
# Something weird: It only shows Taiwan but not any other countries 
# Assume it is a problem of Country_code (Taiwan is the only one whose Country_code is 3-digit)

fig = px.scatter_geo(df_country, locations="Country_code", color='CFR', hover_name="Country",\
             projection="natural earth")
fig.show()
  • In order to be consistent with other countries, I changed the iso_code from TWN (Alpha-3 code) to TW (Alpha-2 code), which will be used later.
In [83]:
df_country.replace('TWN', 'TW', inplace = True)
In [84]:
df_country[df_country.Country == 'Taiwan']
Out[84]:
Country Country_code Cumulative_cases Cumulative_deaths WHO_region CFR
192 Taiwan TW 446.0 7.0 WPRO 0.016

I managed to change Country_code from current 2-digit to 3-digit

In [85]:
# Load the data
iso_alpha = pd.read_csv('wikipedia_isocode.csv')
iso_alpha.head()
Out[85]:
English short name lower case Alpha-2 code Alpha-3 code Numeric code ISO 3166-2
0 Afghanistan AF AFG 4 ISO 3166-2:AF
1 Ã…land Islands AX ALA 248 ISO 3166-2:AX
2 Albania AL ALB 8 ISO 3166-2:AL
3 Algeria DZ DZA 12 ISO 3166-2:DZ
4 American Samoa AS ASM 16 ISO 3166-2:AS
In [86]:
# drop unnecessary columns
iso_alpha.drop(columns = ['English short name lower case',\
                          'Numeric code','ISO 3166-2'], inplace = True)
In [87]:
iso_alpha.shape
Out[87]:
(246, 2)
In [88]:
# Merge df_country & iso_alpha
df = df_country.merge(iso_alpha, left_on='Country_code',right_on='Alpha-2 code', how='left')
In [89]:
df.head()
Out[89]:
Country Country_code Cumulative_cases Cumulative_deaths WHO_region CFR Alpha-2 code Alpha-3 code
0 Afghanistan AF 27878.0 548.0 EMRO 0.020 AF AFG
1 Albania AL 1838.0 42.0 EURO 0.023 AL ALB
2 Algeria DZ 11504.0 825.0 AFRO 0.072 DZ DZA
3 Andorra AD 855.0 52.0 EURO 0.061 AD AND
4 Angola AO 166.0 8.0 AFRO 0.048 AO AGO
In [90]:
# Find out those not been matched: there are 6 records in total
df['Alpha-3 code'].isna().sum()
Out[90]:
6
In [91]:
iso_missing = df[df['Country_code'] != df['Alpha-2 code']]
iso_missing
Out[91]:
Country Country_code Cumulative_cases Cumulative_deaths WHO_region CFR Alpha-2 code Alpha-3 code
24 Bonaire, Sint Eustatius and Saba BQ 7.0 0.0 AMRO 0.000 NaN NaN
48 Curacao CW 23.0 1.0 AMRO 0.043 NaN NaN
107 Kosovo[1] XK 1873.0 24.0 EURO 0.013 NaN NaN
137 Namibia NA 45.0 0.0 AFRO 0.000 NaN NaN
179 Sint Maarten SX 78.0 15.0 AMRO 0.192 NaN NaN
184 South Sudan SS 1830.0 32.0 AFRO 0.017 NaN NaN
  • After finding out exact countries whose Alpha-3 code is missing, I added them manually

source: https://countrycode.org/

In [92]:
iso_missing['Alpha-3 code'] = ['SSD','XKX','SXM','CUW','BES','NAM']
In [93]:
iso_missing.head()
Out[93]:
Country Country_code Cumulative_cases Cumulative_deaths WHO_region CFR Alpha-2 code Alpha-3 code
24 Bonaire, Sint Eustatius and Saba BQ 7.0 0.0 AMRO 0.000 NaN SSD
48 Curacao CW 23.0 1.0 AMRO 0.043 NaN XKX
107 Kosovo[1] XK 1873.0 24.0 EURO 0.013 NaN SXM
137 Namibia NA 45.0 0.0 AFRO 0.000 NaN CUW
179 Sint Maarten SX 78.0 15.0 AMRO 0.192 NaN BES
In [94]:
# Extract those countries Alpha-3 code are matched 
df_noNA = df[df['Country_code'] == df['Alpha-2 code']] 
In [95]:
# Concatenate df_noNA & iso_missing
df = pd.concat([df_noNA, iso_missing], axis = 0)
In [96]:
df.tail()
Out[96]:
Country Country_code Cumulative_cases Cumulative_deaths WHO_region CFR Alpha-2 code Alpha-3 code
48 Curacao CW 23.0 1.0 AMRO 0.043 NaN XKX
107 Kosovo[1] XK 1873.0 24.0 EURO 0.013 NaN SXM
137 Namibia NA 45.0 0.0 AFRO 0.000 NaN CUW
179 Sint Maarten SX 78.0 15.0 AMRO 0.192 NaN BES
184 South Sudan SS 1830.0 32.0 AFRO 0.017 NaN NAM
In [97]:
df['Alpha-3 code'].isna().sum()
Out[97]:
0
In [98]:
# maps: Try it again and get the correct one this time
# Color the closer to yellow means higher CFR(Case Fatality Rate). 
# As we can see Yemen stands out. 

fig = px.choropleth(df, locations="Alpha-3 code", color='CFR', hover_name="Country", \
             projection="natural earth")
fig.show()

Time Series Analysis

By exploring the number of new cases & new deaths to see whether the epidemic has turned to be stable or slowed down

In [99]:
df_World.head()
Out[99]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
0 2020-02-24 00:00:00+00:00 AF Afghanistan EMRO 1.0 1.0 0.0 0.0
1 2020-02-25 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
2 2020-02-26 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
3 2020-02-27 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
4 2020-02-28 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
  • Explore new confirmed cases over time
In [100]:
df_date_case = df_World.groupby('Date_reported')['New_cases'].sum().reset_index()
In [101]:
df_date_case.head()
Out[101]:
Date_reported New_cases
0 2020-01-11 00:00:00+00:00 41.0
1 2020-01-12 00:00:00+00:00 0.0
2 2020-01-13 00:00:00+00:00 1.0
3 2020-01-14 00:00:00+00:00 1.0
4 2020-01-15 00:00:00+00:00 0.0
In [102]:
df_date_case['binned_Dates'] = pd.cut(df_date_case.Date_reported, 5)
df_date_case.head()
Out[102]:
Date_reported New_cases binned_Dates
0 2020-01-11 00:00:00+00:00 41.0 (2020-01-10 20:08:09.600000, 2020-02-12 04:48:00]
1 2020-01-12 00:00:00+00:00 0.0 (2020-01-10 20:08:09.600000, 2020-02-12 04:48:00]
2 2020-01-13 00:00:00+00:00 1.0 (2020-01-10 20:08:09.600000, 2020-02-12 04:48:00]
3 2020-01-14 00:00:00+00:00 1.0 (2020-01-10 20:08:09.600000, 2020-02-12 04:48:00]
4 2020-01-15 00:00:00+00:00 0.0 (2020-01-10 20:08:09.600000, 2020-02-12 04:48:00]

Data Visualization - Plot the data

  • Seaborn
In [103]:
sns.catplot(y='binned_Dates', x='New_cases', data=df_date_case, kind='bar',aspect=3)
Out[103]:
<seaborn.axisgrid.FacetGrid at 0x1a18534e48>
  • Explore new deaths over time
In [104]:
df_date_death = df_World.groupby('Date_reported')['New_deaths'].sum().reset_index()
df_date_death.head()
Out[104]:
Date_reported New_deaths
0 2020-01-11 00:00:00+00:00 1.0
1 2020-01-12 00:00:00+00:00 0.0
2 2020-01-13 00:00:00+00:00 0.0
3 2020-01-14 00:00:00+00:00 0.0
4 2020-01-15 00:00:00+00:00 0.0
In [105]:
df_date_death['binned_Dates'] = pd.cut(df_date_death.Date_reported, 5)
df_date_death.head()
Out[105]:
Date_reported New_deaths binned_Dates
0 2020-01-11 00:00:00+00:00 1.0 (2020-01-10 20:08:09.600000, 2020-02-12 04:48:00]
1 2020-01-12 00:00:00+00:00 0.0 (2020-01-10 20:08:09.600000, 2020-02-12 04:48:00]
2 2020-01-13 00:00:00+00:00 0.0 (2020-01-10 20:08:09.600000, 2020-02-12 04:48:00]
3 2020-01-14 00:00:00+00:00 0.0 (2020-01-10 20:08:09.600000, 2020-02-12 04:48:00]
4 2020-01-15 00:00:00+00:00 0.0 (2020-01-10 20:08:09.600000, 2020-02-12 04:48:00]
In [106]:
sns.catplot(y='binned_Dates', x='New_deaths', data=df_date_death, kind='bar',aspect=3)
Out[106]:
<seaborn.axisgrid.FacetGrid at 0x1a1d1e8e10>
  • For each observationDate, consider all countries, draw a graph shows 2 curves, 1st one for Confirmed total cases, 2nd one for Deaths
In [107]:
df_date = df_World.groupby('Date_reported').sum()[['New_cases','New_deaths']]
df_date.head()
Out[107]:
New_cases New_deaths
Date_reported
2020-01-11 00:00:00+00:00 41.0 1.0
2020-01-12 00:00:00+00:00 0.0 0.0
2020-01-13 00:00:00+00:00 1.0 0.0
2020-01-14 00:00:00+00:00 1.0 0.0
2020-01-15 00:00:00+00:00 0.0 0.0
  • Matplotlib
In [108]:
df_date.plot(title = 'New cases & New deaths over time for all over the world')
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1831d3c8>
  • Plotly Express
In [109]:
# First I need to make data type of Date_reported string
# Otherwise, it will hit an error later while using Plotly Express

df_date = df_date.reset_index()
df_date['Date_reported'] = df_date['Date_reported'].astype('str').apply(lambda x : x[:10])
df_date.head()
Out[109]:
Date_reported New_cases New_deaths
0 2020-01-11 41.0 1.0
1 2020-01-12 0.0 0.0
2 2020-01-13 1.0 0.0
3 2020-01-14 1.0 0.0
4 2020-01-15 0.0 0.0
In [110]:
fig = px.line(df_date, x='Date_reported', y='New_cases', \
             labels={'y':'New_cases'},
             title='New cases over time for all over the world' )
fig.show()

fig = px.line(df_date, x='Date_reported', y='New_deaths', \
             labels={'y':'New_deaths'},
             title='New deaths over time for all over the world')
fig.show()

Interaction: You can explore any country you would like to know by inputting the country name

In [111]:
country = input("Please input the country name:") 
df_country = df_World[df_World.Country == country].set_index('Date_reported')\
                     [['New_cases','New_deaths']]
df_country.plot( title = 'New cases & New deaths over time for ' + '%s' %country)
Please input the country name:United States of America
Out[111]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d360b70>
In [112]:
country = input("Please input the country name:") 

fig = px.line(df_World[df_World.Country == country], x='Date_reported', y='New_cases', \
             labels={'y':'New_cases'},
             title='New cases over time for ' + '%s' %country)
fig.show()

fig = px.line(df_World[df_World.Country == country], x='Date_reported', y='New_deaths', \
             labels={'y':'New_deaths'},
             title='New deaths over time for ' + '%s' %country)
fig.show()
Please input the country name:United States of America
  • Consider all confirmed cases, draw a graph show how many accumulated cases in Mainland China and the other curve for all non-China in January
In [113]:
df_China = df_World[df_World['Country'] == 'China'].set_index('Date_reported')['Cumulative_cases']
In [114]:
df_China_Jan = df_China[df_China.index < '2020-02-01']
df_China_Jan.head()
Out[114]:
Date_reported
2020-01-11 00:00:00+00:00    41.0
2020-01-12 00:00:00+00:00    41.0
2020-01-13 00:00:00+00:00    41.0
2020-01-14 00:00:00+00:00    41.0
2020-01-15 00:00:00+00:00    41.0
Name: Cumulative_cases, dtype: float64
In [115]:
df_nonChina = df_World[df_World['Country'] != 'China']
df_nonChina.head()
Out[115]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
0 2020-02-24 00:00:00+00:00 AF Afghanistan EMRO 1.0 1.0 0.0 0.0
1 2020-02-25 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
2 2020-02-26 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
3 2020-02-27 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
4 2020-02-28 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
In [116]:
df_nonChina = df_nonChina.groupby('Date_reported')['Cumulative_cases'].sum()
df_nonChina.head()
Out[116]:
Date_reported
2020-01-13 00:00:00+00:00    1.0
2020-01-14 00:00:00+00:00    2.0
2020-01-15 00:00:00+00:00    2.0
2020-01-16 00:00:00+00:00    2.0
2020-01-17 00:00:00+00:00    3.0
Name: Cumulative_cases, dtype: float64
In [117]:
df_nonChina_Jan = df_nonChina[df_nonChina.index < '2020-02-01']
In [118]:
plt.plot(df_China_Jan ,'r')
plt.plot(df_nonChina_Jan ,'b')  
plt.legend(['Cumulative_cases_China','Cumulative_cases_NonChina']) # 加上圖例
plt.show()
In [119]:
df.head() # Each row of this DataFrame is country-based, Date_reported column is not included
Out[119]:
Country Country_code Cumulative_cases Cumulative_deaths WHO_region CFR Alpha-2 code Alpha-3 code
0 Afghanistan AF 27878.0 548.0 EMRO 0.020 AF AFG
1 Albania AL 1838.0 42.0 EURO 0.023 AL ALB
2 Algeria DZ 11504.0 825.0 AFRO 0.072 DZ DZA
3 Andorra AD 855.0 52.0 EURO 0.061 AD AND
4 Angola AO 166.0 8.0 AFRO 0.048 AO AGO
In [120]:
iso = df[['Country_code','Alpha-3 code']]
iso.head()
Out[120]:
Country_code Alpha-3 code
0 AF AFG
1 AL ALB
2 DZ DZA
3 AD AND
4 AO AGO
In [121]:
iso.isna().sum()
Out[121]:
Country_code    0
Alpha-3 code    0
dtype: int64
In [122]:
iso.shape
Out[122]:
(216, 2)
In [123]:
# change Country_code of df_World from TWN to TW, otherwise it will not be matched
df_World.replace('TWN', 'TW', inplace = True)
df_World.head()
Out[123]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
0 2020-02-24 00:00:00+00:00 AF Afghanistan EMRO 1.0 1.0 0.0 0.0
1 2020-02-25 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
2 2020-02-26 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
3 2020-02-27 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
4 2020-02-28 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
In [124]:
# Merge df_World & iso dataframe 
df_plot = df_World.merge(iso, on = 'Country_code', how='left')
df_plot.head()
Out[124]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths Alpha-3 code
0 2020-02-24 00:00:00+00:00 AF Afghanistan EMRO 1.0 1.0 0.0 0.0 AFG
1 2020-02-25 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0 AFG
2 2020-02-26 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0 AFG
3 2020-02-27 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0 AFG
4 2020-02-28 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0 AFG
In [125]:
df_plot.isna().sum()
Out[125]:
Date_reported        0
Country_code         0
Country              0
WHO_region           0
New_cases            0
Cumulative_cases     0
New_deaths           0
Cumulative_deaths    0
Alpha-3 code         0
dtype: int64
In [126]:
# In plotly express, the 'Date_reported' should be a string, otherwise it will hit an error
df_plot['Date_reported'] = df_plot.Date_reported.astype('str').apply(lambda x : x[:10])
In [127]:
# Reorder the dataframe based on Date_reported
df_plot = df_plot.sort_values('Date_reported')
df_plot.head()
Out[127]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths Alpha-3 code
4337 2020-01-11 CN China WPRO 41.0 41.0 1.0 1.0 CHN
4338 2020-01-12 CN China WPRO 0.0 41.0 0.0 1.0 CHN
4339 2020-01-13 CN China WPRO 0.0 41.0 0.0 1.0 CHN
20584 2020-01-13 TH Thailand SEARO 1.0 1.0 0.0 0.0 THA
20585 2020-01-14 TH Thailand SEARO 0.0 1.0 0.0 0.0 THA
In [128]:
# Color the closer to dark blue, the more serious situation is compared to other countries.
# China gets a lighter color over time indicates that it is getting less serious relative to otehrs

fig = px.choropleth(df_plot, locations="Alpha-3 code", color='Cumulative_cases'\
                    ,hover_name="Country",animation_frame="Date_reported"\
                    ,color_continuous_scale='Teal',template = 'plotly_white')
fig.show()
In [129]:
fig = px.choropleth(df_plot, locations="Alpha-3 code", color='Cumulative_deaths'\
                    ,hover_name="Country",animation_frame="Date_reported"\
                    ,color_continuous_scale='Reds',template = 'plotly_white')
fig.show()

Integrate Python & MySQL

migrate the information stored in the 'df_new_obs' data frame from Python to SQL

In [132]:
df_World.head()
Out[132]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
0 2020-02-24 00:00:00+00:00 AF Afghanistan EMRO 1.0 1.0 0.0 0.0
1 2020-02-25 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
2 2020-02-26 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
3 2020-02-27 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
4 2020-02-28 00:00:00+00:00 AF Afghanistan EMRO 0.0 1.0 0.0 0.0

Importing the package will establish the connection between python and SQL which will so to speak build a bridge between the two pieces of software. I managed to establish a connection between workbench and Jupiter

Note: All cells below should be executed only once

In [165]:
import pymysql
  • To process any data from Python to SQL, I need to specifically designate the database I would like to connect to. Before that I created a schema/ database called "dbWHO" in MySQL Workbench

  • And then I create a Table called "COVID19" in MySQL dbWHO database

In [166]:
password = 'kitty33697'
In [167]:
conn = pymysql.connect(database = 'dbWHO', user = 'root', password = password)
cursor = conn.cursor()
  • Executing SQL queries from Python
    Python variables can be used as arguments of the .execute() method as well.

  • The relevant code has to be in SQL and be surrounded by parentheses.

In [168]:
# After I run the cell I obtain 0 which means there's currently no data stored in this table.
cursor.execute('SELECT * FROM COVID19;')
Out[168]:
0
In [169]:
# Make sure any data stored in the current DataFrame to be consistent with SQL syntax
df_World.Date_reported = df_World.Date_reported.astype('str').apply(lambda x : x[:10])
df_World.Date_reported = df_World.Date_reported.str.replace('-','/')
df_World.head(3)
Out[169]:
Date_reported Country_code Country WHO_region New_cases Cumulative_cases New_deaths Cumulative_deaths
0 2020/02/24 AF Afghanistan EMRO 1.0 1.0 0.0 0.0
1 2020/02/25 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
2 2020/02/26 AF Afghanistan EMRO 0.0 1.0 0.0 0.0
  • Creating a query and we will store as a string variable in Python and then use as an argument of the execute method. And it is the execute method that will move our data to the designated table in workbench.

  • The number of rows or values to be inserted into MySQL table with a single query may be limited. A limit has been set to twenty thousand rows and therefore I won't be able to complete the insert all at once.

  • I split the query into two parts neither of which can contain more than 20000 records.

In [170]:
# Creating the INSERT Statement 
insert_query = 'INSERT INTO COVID19 VALUES '

for i in range(10000): # The first 10000 rows
    insert_query += '(' 

# I need a second inner loop here which will repeat the next operation for every J.
 
    for j in range(df_World.shape[1]): # 13 columns 
     # Turn the value obtained from each column into a string
     # finally added to our insert statement
        insert_query += "'" + str(df_World.iloc[i][j]) + "', "
    insert_query = insert_query[:-2] + '), ' 

insert_query = insert_query[:-2] + ';'
In [171]:
cursor.execute(insert_query)
Out[171]:
10000
  • When executing an SQL statement affecting the database state from Python, must always commit to this action
In [172]:
conn.commit()
In [173]:
# This cell should be executed before running to the next cell, otherwise it will hit an error
df_World['Country'].replace(to_replace="Lao People's Democratic Republic",\
                                            value='Laos',inplace=True)
In [174]:
insert_query = 'INSERT INTO COVID19 VALUES '

for i in range(10000,df_World.shape[0]): # insert remaining rows

    insert_query += '(' 

# I need a second inner loop here which will repeat the next operation for every J.

    for j in range(df_World.shape[1]): # 13 columns 
     # Turn the value obtained from each column into a string
     # finally added to our insert statement
        insert_query += "'" + str(df_World.iloc[i][j]) + "', "
    insert_query = insert_query[:-2] + '), ' 

insert_query = insert_query[:-2] + ';'
In [175]:
cursor.execute(insert_query)
Out[175]:
13035
In [176]:
conn.commit()
In [177]:
# Now I have added 23035 records to the WHO TABLE.
cursor.execute('SELECT * FROM COVID19;')
Out[177]:
23035
In [178]:
# Export to csv 
df_World.to_csv('COVID19.csv')